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Abstract 

We propose an abstraction-based multi¬ 
document summarization framework that 
can construct new sentences by exploring 
more fine-grained syntactic units than sen¬ 
tences, namely, noun/verb phrases. Dif¬ 
ferent from existing abstraction-based ap¬ 
proaches, our method first constructs a 
pool of concepts and facts represented by 
phrases from the input documents. Then 
new sentences are generated by selecting 
and merging informative phrases to max¬ 
imize the salience of phrases and mean¬ 
while satisfy the sentence construction 
constraints. We employ integer linear op¬ 
timization for conducting phrase selection 
and merging simultaneously in order to 
achieve the global optimal solution for a 
summary. Experimental results on the 
benchmark data set TAG 2011 show that 
our framework outperforms the state-of- 
the-art models under automated pyramid 
evaluation metric, and achieves reasonably 
well results on manual linguistic quality 
evaluation. 

1 Introduction 

Existing multi-document summarization (MDS) 
methods fall in three categories: extraction-based, 
compression-based and abstraction-based. Most 

* The work described in this paper is substan¬ 
tially supported by grants from the Research and De¬ 
velopment Grant of Huawei Technologies Co. Ltd 
(YB2013090068/TH138232) and the Research Grant Coun¬ 
cil of the Hong Kong Special Administrative Region, China 
(Project Codes: 413510 and 14203414). 

The work was done when Weiwei Guo was in Columbia Uni¬ 
versity 


summarization systems adopt the extraction- 
based approach which selects some original sen¬ 
tences from the source documents to create a short 


summary (Erkan and Radev, 2004 Wan et al.. 


20071. However, the restriction that the whole sen¬ 


tence should be selected potentially yields some 
overlapping information in the summary. To this 
end, some researchers apply compression on the 
selected sentences by deleting words or phrases 
( Knight and Marcu, 2000t Ein, 2003| Zajic et 


al., 2006t [Harabagiu and Eacatusu, 2010[ Ei 


et al., 2015|l, which is the compression-based 


method. Yet, these compressive summarization 
models cannot merge facts from different source 
sentences, because all the words in a summary 
sentence are solely from one source sentence. 

In fact, previous investigations show that 
human-written summaries are more abstractive, 
which can be regarded as a result of sentence ag¬ 


gregation and fusion (Cheung and Penn, 2013 


Jing and McKeown, 200011. Some works, albeit 


less popular, have studied abstraction-based ap¬ 
proach that can construct a sentence whose frag¬ 
ments come from different source sentences. One 
important work developed by Barzilay and McK¬ 
eown ( |2005| l employed sentence fusion, followed 
by ( |Eilippova and Strube, 2008||Eilippova, 2010 1 . 
These works first conduct clustering on sentences 
to compute the salience of topical themes. Then, 
sentence fusion is applied within each cluster of 
related sentences to generate a new sentence con¬ 
taining common information units of the sen¬ 
tences. The abstractive-based approaches gather 
information across sentence boundary, and hence 
have the potential to cover more content in a more 
concise manner. 

In this paper, we propose an abstractive MDS 
framework that can construct new sentences by 
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Figure 1: The constituency tree of a sentence from a news document. 


exploring more fine-grained syntactic units than 
sentences, namely, noun/verb phrases (NPs/VPs). 
This idea is based on two observations. First, the 
major constituent phrases loosely correspond to 
the concepts and facts. After reading a set of doc¬ 
uments describing the same topic or event, a per¬ 
son digests these documents as key concepts and 
facts in his/her mind, such as “an armed man” 
and “walked into an Amish school” from Figure 
[T] Second, a summary writer re-organizes the key 
concepts and facts to form new sentences for the 
summary. Accordingly, our proposed framework 
has two major components corresponding to the 
above observations. The first component creates a 
pool of concepts and facts represented by NPs and 
VPs from the input documents. A salience score 
is computed for each phrase by exploiting redun¬ 
dancy of the document content in a global man¬ 
ner. The second component constructs new sen¬ 
tences by selecting and merging phrases based on 
their salience scores, and ensures the validity of 
new sentences using a integer linear optimization 
model. 


The contribution of this paper is two folds. (1) 
We extract NPs/VPs from constituency trees to 
represent key concepts/facts, and merge them to 
construct new sentences, which allows more sum¬ 


mary content units (SCUs) (Nenkova and Passon- 


neau, 20041 to be included in a sentence by break¬ 


ing the original sentence boundaries. (2) The de¬ 
signed optimization framework for addressing the 
problem is unique and effective. Our optimiza¬ 
tion algorithm simultaneously selects and merges 
a set of phrases that maximize the number of cov¬ 


ered SCUs in a summary. Meanwhile, since the 
basic unit is phrases, we design compatibility re¬ 
lations among NPs and VPs, as well as other op¬ 
timization constraints, to ensure that the gener¬ 
ated sentences contain correct facts. Compared 
with the sentence fusion approaches that compute 
salience scores of sentence clusters, our proposed 
framework explores a more fine-grained fexfual 
unif (i.e., phrases), and maximizes fhe salience of 
selecfed phrases in a global manner. 

2 Description of Our Framework 

We firsf infroduce how fo exfracf NPs and VPs 
from consfifuency frees, and subsequenfly calcu- 
lafe salience scores for fhem. Then we formulaic 
fhe senlence generation lask as an opfimizalion 
problem, and design consfrainfs. In fhe end, we 
perform several posf-processing sfeps fo improve 
fhe order and fhe readabilily of fhe generaled sen¬ 
tences. 


2.1 Phrase Salience Calculation 


The firsl componenf decomposes fhe senfences in 
documenls info a sef of noun phrases (NPs) de¬ 
rived from fhe subjecl parfs of a consfifuency free 
and a sef of verb-objecf phrases (VPs), represenl- 
ing polenlial key concepls and key facls, respec¬ 
tively. These phrases will serve as fhe basic ele- 
menls for sentence generafion. 


We employ Slanford parser (Klein and Man 
ning, 2003] ) fo obfain a consfifuency free for each 
inpuf senfence. Afler fhal, we exfracf NPs and VPs 
from fhe free as follows: (1) The NPs and VPs fhaf 
are fhe direcf children of fhe senlence node (repre- 












sented by the S node) are extracted. (2) VPs (NPs) 
in a path on which all the nodes are VPs (NPs) 
are also recursively extracted and regarded as hav¬ 
ing the same parent node S. Recursive operation 
in the second step will only be carried out in two 
levels since the phrases in the lower levels may 
not be able to convey a complete fact. Take the 
tree in Figure [T] as an example, the corresponding 
sentence is decomposed into phrases “An armed 
man”, “walked into an Amish school, sent the boys 
outside and tied up and shot the girls, killing three 
of them”, “walked into an Amish school”, “sent 
the boys outside”, and “tied up and shot the girls, 
killing three of them”. Because of the recursive 
operation, the extracted phrases may have over¬ 
laps. Later, we will show how to avoid such over¬ 
lapping in phrase selection. 


A salience score is calculated for each phrase to 
indicate its importance. Different types of salience 
can be incorporated in our framework, such as 
position-based method (|Yih et al., 2007 |l, statis¬ 


tical feature based method (Woodsend and Lap- 


ata, 20121, concept-based method (Li et al., 20111, 


etc. One key characteristic of our approach is 
that the considered basic units are phrases instead 
of sentences. Such finer granularity leaves more 
room for better global salience score by poten¬ 
tially covering more distinct facts. In our imple¬ 
mentation, we adopt a concept-based weight in¬ 
corporating the position information. The con¬ 
cept set is designated to be the union set of un¬ 
igrams, bigrams, and named entities in the docu¬ 
ments. We remove stopwords and perform lemma- 
tization before extracting unigrams and bigrams. 
The position-based term frequency is used in the 
concept weighting scheme. When counting the 
frequency, each occurrence of a concept in an in¬ 
put document is weighted with the paragraph po¬ 
sition. The weight larger than 1 is given to the 
concept occurrences in the first few paragraphs. 
Specifically, fhe weighl of fhe firsl paragraph is 
B and fhe weighl decreases as fhe posilion of fhe 
paragraph increases from fhe beginning of fhe doc- 


* We only consider the recursive operation for a VP with 
more than one parallel sub-VPs, such as the highest VP in 
Figure [T] The sub-VPs following modal, link or auxiliary 
verbs are not extracted as individual VPs. In addition, we 
also extract the clauses functioning as subjects of sentences 
as NPs, such as “that clause”. Note that we also mention such 
clauses as “noun phrase” although their syntactic labels could 
be “SEAR” or “S”. 


umenl. The weighling funcfion is: 



ff*B 

1 


if p < —(log B/ log p) 
ofherwise 


( 1 ) 

where p is fhe posifion of fhe paragraph sfarfing 
from 0, from beginning of fhe documenf, and p is 
a posilive consfanf and smaller fhan 1. Then, fhe 
salience of a phrase is calculated as fhe summed 
weighfs of ifs concepfs. 


2.2 New Sentence Construction Model 

The construction of new sentences is formulated 
as an optimization problem which is able to si¬ 
multaneously generate a group of sentences. Each 
new sentence is composed of one NP and at least 
one VP, where the NP and VPs may come from 
different source sentences. In the process of new 
sentence generation, the compatibility relation be¬ 
tween NP and VP and a variety of summarization 
requirements are jointly considered. 


2.2.1 Compatibility Relation 

Compatibility relation is designed to indicate 
whether an NP and a VP can be used to form a 
new sentence. For example, the NP “Police” from 
another sentence should not be the subject of the 
VP “sent the boys outside” extracted from Figure 
[T] We use some heuristics to find compatibility, 
and then expand the compatibility relation to more 
phrases by extracting coreference. 

To find coreference NPs (different mentions for 
the same entity), we first conduct coreference res¬ 
olution for each document with Stanford corefer¬ 
ence resolution package (Fee et al., 20131. We 
adopt those resolution rules that are able to achieve 
high quality and address our need for summariza¬ 
tion. In particular. Sieve 1, 2, 3, 4, 5, 9, and 10 
in the package are used. A set of clusters are ob¬ 
tained and each cluster contains the mentions that 
refer to the same entity in a document. The clus¬ 
ters from different documents in the same topic 
are merged by matching the named entities. After 
merging, the mentions that are not NPs extracted 
in the phrase extraction step are removed in each 
cluster. Two NPs in the same cluster are deter¬ 
mined as alternative of each other. 

To find alternative VPs, Jaccard Index is em¬ 
ployed as the similarity measure. Specifically, 
each VP is represented as a set of its concepts and 
the index value is calculated for each pair of VPs. 
If the value is larger than a threshold, the two VPs 
are determined as alternative of each other. 








We then define an indicator matrix r|]vj||v|, in 
which r[z, j] = 1 if an NP Ni and a VP Vj come 
from the same node S in the constituency tree, oth¬ 
erwise, r[i, j] = 0. Let Nj and Vj represent the al¬ 
ternative phrases of Ni and Vi as described above. 
The compatibility matrix r|N||v| is defined as fol- 

1 if Np £ Ni A r[t, q\ = f 
1 if Vg G Vj Ar[p,y] = 1 
1 ifT\p,q] = l 
0 otherwise 


lows: 


Notation 

Description 

N„V 

Oti, Pi 

Ctij , Pij 

gN gV 

pV 

Noun phrase i and verb phrase i 

Selection indicators of Ni and Vi 

Co-occurrence indicators of pairs {Ni, Nj) and 
(Vi, Vj) 

Salience scores of Ni and Vi 

Similarity of pair (Ni, Nj) and pair (Vi, Vj) 

PlNIlVI 

P[*iiJ = I if M and Vj are from the same sen¬ 
tence 

N„ V. 

The alternative phrases of Ni and Vi 

f|N||V| 

Pltii] — 1 means Ni and Vj are compatible for 
being used to construct a new sentence 

pij 

Sentence generation indicator for Ni and Vj if 
= 1 


Table 1: Notations. 


where f [p, g] = 1 means Np and Vq are compat¬ 
ible/permitted for constructing a new sentence. T 
is the final compatibility matrix that we use in the 
optimization. The first case of Equation [^implies 
that if Np and N are coreferent, Np can replace 
Ni and serve as the subject of iVj’s VP (i.e., Vq). 
The second case implies that if Vq is very similar 
to Vj, Vq can be concatenated to Vj ’s NP (i.e., Np). 

2 . 2.2 Phrase-based Content Optimization 

The overall objective function of our optimization 
formulation to select NPs and VPs is defined as: 


are different selection/penalty terms for NP and 
VP. Such design enables us to avoid the false 
penalty between an NP and a VP. For example, the 
algorithm produces two sentences: the first sen¬ 
tence is “the gunman shot ...” with an NP “the 
gunman”, and the other sentence has a VP “con¬ 
firmed the gunman died”. Obviously, we should 
not penalize the redundancy between them, be¬ 
cause mentioning the gunman is necessary in both 
sentences. 

2.2.3 Sentence Generation Constraints 


max{ aiSf - Y + -S'f 

i i<j 

+ ^ + sj )rY^}, 

i i<j 

( 3 ) 

where a* and fii are selection indicators for the 
NP Ni and the VP Vi, respectively. and SY 
are the salience scores of Ni and Vj. aij and fiij 
are co-occurrence indicators of pairs (Ni, Nj) and 
(Vi, Vj)- Rfj and are the similarity of pairs 

{Ni, Nj) and (Vj, Vj). If Aj and Nj are coreferent, 

R^ = 1. Otherwise, the similarity is calculated 
with the above Jaccard Index based method. The 
notations are summarized in Table [T] 

Specifically, we maximize the salience score of 
the selected NPs and VPs as indicated by the first 
and the third terms in Equation]^ and penalize the 
selection of similar NP pairs and similar VP pairs 
as indicated by the second and the fourth terms. 
Meanwhile, the phrase selection is governed by a 
set of constraints so that the selected phrases can 
generate valid sentences. The constraints will be 
explained in details in Section [2.2.3 

One characteristic of our objective function is 
that NPs and VPs are treated differently, i.e., there 


To summarize the related sentences in the docu¬ 
ments, human writers usually merge the important 
facts in different VPs about the same entity into a 
single sentence, and omit the trivial facts. Also, 
the same entity is likely to be described by coref¬ 
erent NPs. Therefore, in our approach, only one 
NP is selected and employed as the subject of the 
newly generated sentence, which is then concate¬ 
nated with the merged facts (i.e., VPs). If the com¬ 
patibility entry r[i, j] for Ni and is I, we de¬ 
fine a sentence generation indicator fiij to indicate 
whether both Ni and Vj are selected to construct a 
new sentence in the summary. 

We design the following groups of constraints 
to realize our aim of phrase selection and new 
sentence construction. The objective function and 
constraints are linear, therefore the problem can 
be solved by existing Integer Einear Programming 


(lEP) solvers such as simplex algorithm (Dantzig 


and Thapa, 1991) . 

NP validity. To maintain the consistency be¬ 
tween the selection indicator a and the compati¬ 
bility entry T for NP Ni, we introduce two con¬ 
straints as follows: 


> fiij] 


( 4 ) 














These two constraints work together to ensure the 
valid assignment of a according to the compatibil¬ 
ity entry T. 

VP legality. Similarly, the following require¬ 
ment guarantees the consistency between the se¬ 
lection indicator /3 and the compatibility entry T 
for selected VP Vi: 

= ( 5 ) 

i 

The above two constraints jointly ensure that the 
selected NPs and VPs are able to form new sum¬ 
mary sentences according to the values of sentence 
generation indicators. 

Not i-within-i. Two phrases in the same 
path of a constituency tree cannot be chosen at the 
same time: 

if 3Vfc Vj , then (3k + Pj <1, 
if 3Nk ^ Nj, then ak + aj < 1. 


Short sentence avoidance. We do not 

select the VPs from very short sentences because a 
short sentence normally cannot convey a complete 
key fact (iWoodsend and Lapata, 2012|. 


if/(S) <M,Vi£ S,then;0i = 0, 


(14) 


where M is the threshold of the sentence length. 
Pronoun avoidance. We exclude the NPs 
that are pronouns from being selected as the sub¬ 
ject of the new sentences. As previously observed 
( Woodsend and Lapata, 201^ , pronouns are nor¬ 

mally not used by human summary writers. It is 
because the summary is short and the narration 
relation of sentences is relatively simple so that 
pronouns are not needed. Moreover, in automatic 
summary, pronouns will cause ambiguity in the 
summary, especially when the sentence order is 
automatically determined. Therefore, we model 
the constraint as: 


For example, '"walked into an Amish school, sent 
the boys outside and tied up and shot the girls, 
killing three of them” and “walked into an Amish 
school” cannot be both incorporated in the sum¬ 
mary, because of the obvious redundancy. 
Phrase co-occurrence. These constraints 
control the co-occurrence relation of NPs or VPs. 
For NPs, we introduce three constraints: 


if Ni is pronoun, then a* = 0. (15) 

Length constraint. The overall length of 
the selected NPs and VPs is no larger than a limit 
L: 

* a,} + * f3j} < L, (16) 

* j 

where (() is the word-based length of a phrase. 


aij - a* * < 0, (7) 

aij - aj < 0 , ( 8 ) 

ai + aj-aij<l. (9) 

Constraints |7] to |9] ensure a valid solution of NP 
selection. The first two constraints state that if the 
units Ni and Nj co-occur in the summary (i.e., 
Uij = 1), then we have to include them individ¬ 
ually (i.e., ai = 1 and aj = 1). The third con¬ 
straint is the inverse of the first two. Similarly, the 
constraints for VPs are as follows: 

/3ij - A < 0, (10) 

Ai-/3i<0, (11) 

/3i + /3j-/3ij <1. (12) 

Sentence number. In abstractive summariza¬ 
tion, we do not prefer to generate many short sen¬ 
tences. This is controlled by: 

Y.ai<K, (13) 

i 

where K is the maximum number of sentences. 


2.3 Postprocessing 

Recall that we require that one NP and at least 
one VP compose a sentence. Thus, we form a 
raw sentence with a selected NP as the subject 
followed by the corresponding selected VPs that 
are indicated by sentence generation indicator fij 
having the value 1. The VPs in a summary sen¬ 
tence are ordered according to their natural order 
if they come from the same document. Otherwise, 
they are ordered according to the timestamps of 
the corresponding documents. After that, if the to¬ 
tal length is smaller than L, we add conjunctions 
such as “and” and “then” to concatenate the VPs 
for improving the readability of the newly gener¬ 
ated sentences. The pseudo-timestamp of a sen¬ 
tence is defined as the earliest timestamp of its 
VPs and the sentences are ordered based on their 
pseudo-timestamps. 

2.4 Relation to Existing MDS Approaches 

Many existing extraction-based and compression- 
based MDS approaches could be regarded as spe¬ 
cial cases under our framework: (1) To simulate 






extraction-based summarization, we just need to 
constrain that the highest NP and the highest VP 
from the same sentence are selected simultane¬ 
ously. In addition, no NPs and VPs in lower lev¬ 
els can be selected. Thus, the output only con¬ 
tains the original sentences of the source docu¬ 
ments. (2) To simulate compression-based sum¬ 
marization, we can adapt our framework to con¬ 
duct sentence selection and sentence compression 
in a joint manner. Specifically, we only need to re¬ 
strict that the NP and VPs of a summary sentence 
must come from the same original sentence. 


3 Experiments 

3.1 Experimental Setup 

The data set of traditional summarization task in 
Text Analysis Conference (TAC) 2011 is used to 
evaluate the performance of our approach. This 
data set is the latest one and it contains 44 topics. 
Each topic falls into one of 5 predefined event cat¬ 
egories and contains 10 related news documents. 
There are four writers to write model summaries 
for each topic. 

The data set of traditional summarization task in 
TAC 2010 is employed as the development/tuning 
data set. This data set contains 46 topics from the 
same predefined categories. Each topic also has 
10 documents and 4 model summaries. 

Based on the tuning set, the key parameters of 
our model are set as follows. The constants B and 
p in the weighting function are set to 6 and 0.5 
repectively. The similarity threshold in obtaining 
the alternative VPs is 0.75. We did not observe sig¬ 
nificant difference between cosine similarity and 
Jaccard Index. 

We mainly evaluate the system by pyramid eval¬ 
uation. To gain a comprehensive understanding, 
we also evaluate by ROUGE evaluation and man¬ 
ual linguistic quality evaluation. 


3.2 Results with Pyramid Evaluation 


The pyramid evaluation metric (Nenkova and Pas- 


sonneau, 2004| ) involves semantic matching of 
summary content units (SCUs) so as to recognize 
alternate realizations of the same meaning. Differ¬ 
ent weights are assigned to SCUs based on their 
frequency in model summaries. A weighted inven¬ 
tory of SCUs named a pyramid is created, which 
constitutes a resource for investigating alternate 
realizations of the same meaning. Such property 
makes pyramid method more suitable to evalu- 


System 

Auto-pyr 
(Th: .6) 

Auto-pyr 
(Th:.65) 

Rank in 

TAC 2011 

Our 

0.905 

0.793 

NA 

22 

0.878 

0.775 

1 

43 

0.875 

0.756 

2 

17 

0.860 

0.741 

3 


Table 2: Comparison with the top 3 systems in 
TAC 2011. 


ate summaries. Another widely used evaluation 
metric is ROUGE ( Ein and Hovy, 2003 | l and it 
evaluates summaries from word overlapping per¬ 
spective. Because of the strict string matching, it 
ignores the semantic content units and performs 
better when larger sets of model summaries are 
available. In contrast to ROUGE, pyramid scor¬ 
ing is robust with as few as four model summaries 
( |Nenkova and Passonneau, 2004| ). Therefore, in 
recent summarization evaluation workshops such 
as TAC, the pyramid is used as the major metric. 

Since manual pyramid evaluation is time- 
consuming, and the exact evaluation scores are 
not reproducible especially when the assessors for 
our results are different from those of TAC, we 
employ the automated version of pyramid pro¬ 
posed in ( [Passonneau et al., 2013 1. The automated 
pyramid scoring procedure relies on distributional 
semantics to assign SCUs to a target summary. 
Specifically, all n-grams within sentence bounds 
are extracted, and converted into 100 dimension 
latent topical vectors via a weighted matrix fac¬ 
torization model (|Guo and Diab, 2012||. Simi¬ 


larly, the contributors and the label of an SCU 
are transformed into 100 dimensional vector rep¬ 
resentations. An SCU is assigned to a summary 
if there exists an n-gram such that the similarity 
score between the SCU low dimensional vector 
and the n-gram low dimensional vector exceeds 
a threshold. Passonneau et al. ( |2013[ ) showed 
that the distributional similarity based method pro¬ 
duces automated scores that correlate well with 
manual pyramid scores, yielding more accurate 
pyramid scores than string matching based auto¬ 
mated methods (Harnly et al., 2003]). In this pa¬ 


per, we adopt the same setting as in (Passonneau 
et al., 2013] |: a 100 dimension matrix factorization 
model is learned on a domain independent corpus, 
which is drawn from sense definitions of WordNet 
and Wiktionar)]^ and Brown corpus. We exper- 


^http:// en. wiktionary. org/ 
































ROUGE-2 ROUGE-SU4 


System 

P 

R 

El 

P 

R 

El 

Our 

0.117 

0.117 

0.117 

0.148 

0.147 

0.148 

22 

0.112 

0.114 

0.113 

0.147 

0.150 

0.148 

43 

0.132 

0.135 

0.134 

0.162 

0.166 

0.164 

17 

0.128 

0.131 

0.129 

0.157 

0.160 

0.159 


Table 3: Performance under ROUGE metric. 


iment with 2 threshold values, i.e., 0.6 and 0.65, 
similar to those used in dPassonneau et al., 2013[ ). 

The top three systems in TAG 2011 evaluated 
with manual pyramid score were System 22 ( Li et| 
al., 201 1| ), 43, and 17 ( |Ng et al., 20~TT] ). Table 


shows the comparison with them under the auto¬ 
mated pyramid evaluation. Our method achieves 
the best results in both thresholds, which means 
that our method is able to find more semantic con¬ 
tent units (SCUs) than the state-of-the-art system 
in TAG 2011. In addition, paired t-test (with p < 
0.01) comparing our model with the best system 
in TAG 2011, i.e.. System 22, shows that the per¬ 
formance of our model is significantly better. It is 
worth noting that the three systems used additional 
external linguistic resources: System 22 used a 
Wikipedia corpus for providing domain knowl¬ 
edge, System 17 and 43 defined some cafegory- 
specific feafures. Wifhouf any domain adapfion, 
our framework can still achieve encouraging per¬ 
formance. 

We calculafe Pearson’s correlafion to measure 
how well the automatic pyramid approximates the 
manual pyramid scores for 50 system submissions 
in TAG 2011. The values are 0.91 and 0.93 for 
thresholds 0.6 and 0.65 respectively. It demon¬ 
strates that the automated pyramid is reliable to 
differentiate the performance of different methods. 


3.3 Results with ROUGE Evaluation 

As mentioned above, we favor the pyramid evalua¬ 
tion over the ROUGE score because it can measure 
the summary quality beyond simply string match¬ 
ing. Here, we also provide ROUGE score for our 
reference. ROUGE-1.5.5 package]^ is employed 
with the same parameters as in TAG. The results 
are summarized in Table Our performance is 
slightly better than System 22, and it is not as good 
as System 43 and 17. The reason is that System 43 
and 17 used category-specific feafures and frained 
fhe feafure weighfs wifh fhe cafegory informafion 

^ http: //ww w.berouge. com/Page s/default. aspx 


in TAG 2010 dafa. These feafures help fhem se- 
lecf better category-specific confenf for fhe sum¬ 
mary. However, fhe usabilify of such feafures de¬ 
pends on fhe availabilify of predefined cafegories 
in fhe summarizafion fask, as well as fhe avail- 
abilify of fraining dafa wifh fhe same predefined 
cafegories for estimating feafure weighfs. There¬ 
fore, the adaptability of these methods is limited to 
some extent. In contrast, our framework does not 
define any cafegory-specific feafure and only uses 
TAG 2010 dafa fo fune fhe paramefers for general 
summarizafion purpose. 

3.4 Linguistic Quality Evaluation 

The linguistic quality of summaries is evaluated 
using the five linguistic quality questions on gram- 
maticality (Ql), non-redundancy (Q2), referential 
clarity (Q3), focus (Q4), and coherence (Q5) in 
Document Understanding Conferences (DUG). A 
Likert scale with five levels is employed with 5 be¬ 
ing very good with 1 being very poor. A summary 
was blindly evaluated by three assessors on each 
question. System 22 performed better than Sys¬ 
tem 43 and 17 in TAG 2011 on the evaluation of 
readability, which is an aggregation of the above 
questions. Considering the intensive labor force of 
manual assessment, we only conduct comparison 
with System 22. 

The results are given in Table On average, 
the two systems perform very closely. System 22 
is an extraction-based method that picks the orig¬ 
inal sentences, hence it achieves higher score in 
Ql grammaticality, while our approach has some 
new sentences with grammar mistakes, which is a 
common problem for abstractive methods and de¬ 
serves more future research effort. Eor Q4 focus, 
our score is higher than System 22, which reveals 
that our summary sentences are relatively more co¬ 
hesive. The score of Q3 referential clarity shows 
that the referential relation is basically clear in our 
summaries, even when new sentences are automat¬ 
ically generated. In general, ignoring the gram¬ 
maticality scores, our system still performs better 
than System 22. Specifically, the average scores 
of our system and System 22 on the last four ques¬ 
tions are 3.37 and 3.33 respectively. 

4 Qualitative Results 

4.1 Analysis of Summary Sentence Type 

There are three types of sentences in the sum¬ 
maries generated by our framework, namely, new 



















System 

Qi 

Q2 

Q3 

Q4 

Q5 

AVG 

Our 

3.67 

3.50 

3.90 

3.23 

2.83 

3.43 

22 

4.13 

3.50 

3.97 

2.97 

2.87 

3.49 


Table 4: Evaluation of linguistic quality. 


sentences, compressed sentences, and original 
sentences. A new sentence is constructed by merg¬ 
ing the phrases from different original sentences. 
A compressed sentence is generated by deleting 
phrases from an original sentence. An original 
sentence in the summary is directly extracted from 
the input documents. 

The percentage of different types of sentences 
in our summaries is calculated. About 33% of the 
summary sentences are newly constructed. This 
demonstrates that our framework has good capa¬ 
bility of merging phrases from the original sen¬ 
tences so as to convey more information in com¬ 
pacted summaries. In addition, about 44% of the 
summary sentences are generated by compression. 
It shows a unique characteristic of our framework: 
sentence construction and sentence compression 
are conducted in a unified model. 

4.2 Case Study 

Table shows the summary of the first topic, 
i.e., '"Amish Shooting”, by our framework. 
The summary sentence ID and the sentence 
type are given in the form of “[summary 
sentence ID: sentence type]”. Each 
selected phrase and the original sentence ID 
where the phrase originated are given in the 
form of “{selected phrase (original 
sentence ID)}”. There are three compressed 
sentences with IDs I, 2, and 4, one new sentence 
with ID 3, and two original sentences with IDs 5 
and 6. 

The new sentence is constructed from the fol¬ 
lowing original sentences in which the extracted 
NPs and VPs are indicated with colored parenthe¬ 
ses: 

(84) : On Monday morning, (NP Charles Carl 
Roberts IV) (VP (VP entered the West Nickel 
Mines Amish School in Eancaster County) and 
(VP shot 10 girls), (VP killing five) ). 

(85) : (NP Roberts ' ( VP killed himself as police 
stormed the building ) . 

(150): (NP Roberts (VP left what they de¬ 
scribed as rambling notes for his family ) . 


[ 1: C ] {An armed man (25)} {walked into 
an Amish school (25)} {tied up and shot the 
girls, killing three of them. (25)} [ 2 : C ] 
{A man who laid siege to a one-room Amish 
schoolhouse (64)} {told his wife shortly be¬ 
fore opening fire that he had molested two young 
girls who were his relatives decades ago (64)} 
{was tormented by dreams of molesting again. 
(64)} [ 3 : N ] {Charles Carl Roberts IV (84)} 
{killed himself as police stormed the building 
(85)} {left what they described as rambling 
notes for his family. (150)} [ 4 : C ] {The gun¬ 
man (14 5)} {was not Amish (145)} {had not 
attended the school. (14 5)} [5:0] {The shoot¬ 
ings (14 8)} {occurred about 10:45 a.m. (148)} 
[6:0] {Police (149) } {could offer no explana¬ 
tion for the killings. (14 9)} 

Table 5: The summary of "Amish Shooting” topic. 

The NPs of these sentences are coreferent so that 
some of their VPs are merged and concatenated 
with one NP, i.e., “Charles Carl Roberts IV\ 

The summary sentences with IDs 1, 2, and 4 
are compressions from the following original sen¬ 
tences respectively: 

(25): (NP An armed man ) (VP (VP walked into 
an Amish school), (VP sent the boys outside) 
and (VP tied up and shot the girls, killing three 
of them) ) , (NP authorities) (VP said). 

(64): (NP (NP A man) who laid siege to a 
one-room Amish schoolhouse), (VP killing five 
girls), (VP (VP told his wife shortly before open¬ 
ing fire that he had molested two young girls who 
were his relatives decades ago) and (VP was tor¬ 
mented by “dreams of molesting again” ) ) , (NP 
authorities) (VPsaidTue . 

(145): According to media reports, (NP the 
gunman) (VP (VP was not Amish) and (VP had 
not attended the school ) ) . 


Some uncritical information is excluded from 
the summary sentences, such as “sent the boys 
outside”, “authorities said”, etc. In addition, the 
VP “killing five girls” of the original sentence 
with ID 64 is also excluded since it has significant 
redundancy with the summary sentence with ID 1. 

5 Related Work 

Existing multi-document summarization (MDS) 
works can be classified into three categories: 

















extraction-based approaches, compression-based 
approaches, and abstraction-based approaches. 

Extraction-based approaches are the most stud- 


and Lapalme, 201^ as components for generat¬ 


ing sentences. Summary revision was also inves¬ 
tigated to improve the quality of automatic sum¬ 
mary by rewriting the noun phrases or people ref- 


greedy slralegy in sentence selecfion (Celikyilmaz 

erences in fhe summaries (jNenkova, 2008t |Sid- 

and Hakkani-Tur, 2011 Goldstein ef al., 20001 

dharlhan el al., 201 1||. Sentence generation wilh 

Wan el al., 2007]|. Each senfence in fhe docu- 

word graph was applied for summarizing cuslomer 


ments is firstly assigned a salience score. Then, 
sentence selection is performed by greedily select¬ 
ing the sentence with the largest salience score 
among the remaining ones. The redundancy is 
controlled during the selection by penalizing the 
remaining ones according to their similarity with 
the selected sentences. An obvious drawback of 
such greedy strategy is that it is easily trapped 
in local optima. Later, unified models are pro¬ 
posed fo conducf senfence selecfion and redun- 


dancy confrol simulfaneously (|McDonald, 2007 

Lilalova and Halzivassiloglou, 2004 

Yih ef al.. 

2007 Gillick el al., 2007[ Lin and Bilmes, 2010 

Lin and Bilmes, 2012| |Sipos et al., 2012 

I. How- 


ever, exfracfion-based approaches are unable fo 
evaluafe fhe salience and confrol fhe redundancy 
on fhe granularify finer fhan senfences. Thus, fhe 
selecfed senfences may sfill confain unimporfanf 
or redundanf phrases. 


|2010t[Mehdad ef al., 20141 ). 


Recenfly, fhe factors of informafion cerfainly 


and timeline in MDS fask were explored (Ng ef 


al., 2014| Wan and Zhang, 2014[ Yan el al., 20111. 
Researchers also explored some varianfs of fhe 
typical MDS setting, such as query-chain focused 
summarizafion fhaf combines aspecfs of updafe 
summarizafion and query-focused summarization 
dBaumel el al., 20141), and hierarchical siimma- 


rizalion lhal scales up MDS fo summarize a large 
sel of documenfs (IChrisfensen el al., 2014||. A 


dafa-driven mefhod for mining senfence slruclures 
on large news archive was proposed and utilized 


to summarize unseen news evenls (Pighin el al.. 


|2014[ ). Moreover, some works ( jLiu ef al., 2012 


Kageback ef al., 2014| Denil ef al, 2014] Cao 


el al., 20151 utilized deep learning techniques to 


lackle some summarizafion lasks. 


Compression-based approaches have been in- 
vesfigaled fo alleviafe fhe above limifalion. As 
a nalural exfension of fhe exfracfive mefhod, fhe 


early works adopled a Iwo-slep approach (Lin, 


2003 Zajic ef al., 2006| Gillick and Lavre, 20091. 
The firsl step selecfs fhe senfences, and fhe second 
sfep removes fhe unimporfanf or redundanf unils 
from fhe senfences. Recenfly, infegrafed models 
have been proposed lhal joinlly conducf senfence 


exlraclion and compression (Marlins and Smith, 


2009 

Woodsend and Lapala, 2010 

Almeida and 

Marlins, 2013[ Berg-Kirkpafrick el al., 201 1[ Li el 


al., 20151. Note that our model also jointly con¬ 


ducts phrase selection and phrase merging (new 
sentence generation). Nonetheless, compressive 
methods are unable to merge the related facts from 
different sentences. 

On the other hand, abstraction-based ap¬ 
proaches can generate new sentences based on the 
facts from different source sentences. In addition 
to the previously mentioned sentence fusion work, 
new directions have been explored. Researchers 
developed an information extraction based ap¬ 


proach that extracts information items (Genest and 


Lapalme, 20111 or abstraction schemes (Genest 


6 Conclusions and Future Work 

We propose an abstractive MDS framework that 
constructs new sentences by exploring more fine¬ 
grained synlacfic unifs, namely, noun phrases and 
verb phrases. The designed opfimizalion frame¬ 
work operales on fhe summary level so lhal more 
complemenlary semanlic conlenl unils can be in¬ 
corporated. The phrase selecfion and merging is 
done simulfaneously fo achieve global oplimal. 
Meanwhile, fhe conslrucled senfences should sal- 
isfy fhe conslrainls relaled fo summarization re¬ 
quirements such as NPWP compatibility. Exper¬ 
imental results on TAG 2011 summarization data 
set show that our framework outperforms the top 
systems in TAG 2011 under the pyramid metric. 
Lor future work, one aspect is to enhance the 
grammar quality of the generated new sentences 
and compressed sentences. Another aspect is to 
improve time efficiency of our framework, and ils 
major bottleneck is fhe time consuming ILP opfi- 
mzalion. 
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